Overview

Dataset statistics

Number of variables4
Number of observations1000
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory31.4 KiB
Average record size in memory32.1 B

Variable types

Numeric2
Categorical2

Alerts

birthDate has a high cardinality: 516 distinct values High cardinality
nationality has a high cardinality: 54 distinct values High cardinality
birthDate is uniformly distributed Uniform
df_index has unique values Unique
statementID has unique values Unique

Reproduction

Analysis started2022-06-01 21:25:11.553460
Analysis finished2022-06-01 21:25:28.728519
Duration17.18 seconds
Software versionpandas-profiling v3.2.0
Download configurationconfig.json

Variables

df_index
Real number (ℝ≥0)

UNIQUE

Distinct1000
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2623.861
Minimum0
Maximum5374
Zeros1
Zeros (%)0.1%
Negative0
Negative (%)0.0%
Memory size7.9 KiB
2022-06-01T22:25:28.792688image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile270.95
Q11319.75
median2564
Q33944.75
95-th percentile5093.25
Maximum5374
Range5374
Interquartile range (IQR)2625

Descriptive statistics

Standard deviation1523.952175
Coefficient of variation (CV)0.5808052237
Kurtosis-1.165308102
Mean2623.861
Median Absolute Deviation (MAD)1335
Skewness0.07968661155
Sum2623861
Variance2322430.232
MonotonicityNot monotonic
2022-06-01T22:25:28.891464image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
17951
 
0.1%
27881
 
0.1%
13611
 
0.1%
36771
 
0.1%
37321
 
0.1%
38961
 
0.1%
20561
 
0.1%
11331
 
0.1%
29371
 
0.1%
46091
 
0.1%
Other values (990)990
99.0%
ValueCountFrequency (%)
01
0.1%
81
0.1%
111
0.1%
131
0.1%
231
0.1%
271
0.1%
321
0.1%
391
0.1%
471
0.1%
481
0.1%
ValueCountFrequency (%)
53741
0.1%
53651
0.1%
53481
0.1%
53381
0.1%
53361
0.1%
53221
0.1%
53051
0.1%
53031
0.1%
52981
0.1%
52971
0.1%

statementID
Real number (ℝ≥0)

UNIQUE

Distinct1000
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean9.391612659 × 1018
Minimum2.261760228 × 1016
Maximum1.843121297 × 1019
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size7.9 KiB
2022-06-01T22:25:28.974036image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum2.261760228 × 1016
5-th percentile1.077589261 × 1018
Q14.859557379 × 1018
median9.457167358 × 1018
Q31.391643545 × 1019
95-th percentile1.740639106 × 1019
Maximum1.843121297 × 1019
Range1.840859537 × 1019
Interquartile range (IQR)9.056878066 × 1018

Descriptive statistics

Standard deviation5.269885424 × 1018
Coefficient of variation (CV)0.5611267857
Kurtosis-1.183598067
Mean9.391612659 × 1018
Median Absolute Deviation (MAD)4.546394439 × 1018
Skewness-0.03695536229
Sum9.391612659 × 1021
Variance2.777169238 × 1037
MonotonicityNot monotonic
2022-06-01T22:25:29.052760image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
1.894954634 × 10181
 
0.1%
1.301353082 × 10191
 
0.1%
1.808278939 × 10181
 
0.1%
2.879329166 × 10171
 
0.1%
1.361266249 × 10191
 
0.1%
1.676381301 × 10191
 
0.1%
1.368662257 × 10181
 
0.1%
6.747802971 × 10181
 
0.1%
4.111929883 × 10181
 
0.1%
1.060704421 × 10191
 
0.1%
Other values (990)990
99.0%
ValueCountFrequency (%)
2.261760228 × 10161
0.1%
3.087674609 × 10161
0.1%
4.134186759 × 10161
0.1%
4.897217398 × 10161
0.1%
6.045753032 × 10161
0.1%
7.747382497 × 10161
0.1%
9.664508252 × 10161
0.1%
1.039521447 × 10171
0.1%
1.138621581 × 10171
0.1%
1.338930217 × 10171
0.1%
ValueCountFrequency (%)
1.843121297 × 10191
0.1%
1.841922293 × 10191
0.1%
1.841212592 × 10191
0.1%
1.839990486 × 10191
0.1%
1.839989714 × 10191
0.1%
1.839910485 × 10191
0.1%
1.839329021 × 10191
0.1%
1.829387604 × 10191
0.1%
1.828284347 × 10191
0.1%
1.828166149 × 10191
0.1%

birthDate
Categorical

HIGH CARDINALITY
UNIFORM

Distinct516
Distinct (%)51.6%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
1985-01-01
 
6
1987-08-01
 
6
1961-08-01
 
5
1991-04-01
 
5
1981-02-01
 
5
Other values (511)
973 

Length

Max length10
Median length10
Mean length10
Min length10

Characters and Unicode

Total characters10000
Distinct characters11
Distinct categories2 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique224 ?
Unique (%)22.4%

Sample

1st row1977-11-01
2nd row1976-04-01
3rd row1944-08-01
4th row1983-03-01
5th row1975-11-01

Common Values

ValueCountFrequency (%)
1985-01-016
 
0.6%
1987-08-016
 
0.6%
1961-08-015
 
0.5%
1991-04-015
 
0.5%
1981-02-015
 
0.5%
1965-03-015
 
0.5%
1976-03-015
 
0.5%
1968-06-015
 
0.5%
1986-08-015
 
0.5%
1981-12-015
 
0.5%
Other values (506)948
94.8%

Length

2022-06-01T22:25:29.123038image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
1985-01-016
 
0.6%
1987-08-016
 
0.6%
1981-12-015
 
0.5%
1975-01-015
 
0.5%
1985-09-015
 
0.5%
1974-01-015
 
0.5%
1980-12-015
 
0.5%
1976-01-015
 
0.5%
1978-06-015
 
0.5%
1985-04-015
 
0.5%
Other values (506)948
94.8%

Most occurring characters

ValueCountFrequency (%)
12509
25.1%
-2000
20.0%
01976
19.8%
91272
12.7%
7456
 
4.6%
8426
 
4.3%
6369
 
3.7%
5312
 
3.1%
2272
 
2.7%
4229
 
2.3%

Most occurring categories

ValueCountFrequency (%)
Decimal Number8000
80.0%
Dash Punctuation2000
 
20.0%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
12509
31.4%
01976
24.7%
91272
15.9%
7456
 
5.7%
8426
 
5.3%
6369
 
4.6%
5312
 
3.9%
2272
 
3.4%
4229
 
2.9%
3179
 
2.2%
Dash Punctuation
ValueCountFrequency (%)
-2000
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common10000
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
12509
25.1%
-2000
20.0%
01976
19.8%
91272
12.7%
7456
 
4.6%
8426
 
4.3%
6369
 
3.7%
5312
 
3.1%
2272
 
2.7%
4229
 
2.3%

Most occurring blocks

ValueCountFrequency (%)
ASCII10000
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
12509
25.1%
-2000
20.0%
01976
19.8%
91272
12.7%
7456
 
4.6%
8426
 
4.3%
6369
 
3.7%
5312
 
3.1%
2272
 
2.7%
4229
 
2.3%

nationality
Categorical

HIGH CARDINALITY

Distinct54
Distinct (%)5.4%
Missing0
Missing (%)0.0%
Memory size7.9 KiB
GB
843 
RO
 
16
IE
 
14
PH
 
14
PK
 
12
Other values (49)
101 

Length

Max length2
Median length2
Mean length2
Min length2

Characters and Unicode

Total characters2000
Distinct characters25
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique29 ?
Unique (%)2.9%

Sample

1st rowGB
2nd rowGB
3rd rowGB
4th rowGB
5th rowGB

Common Values

ValueCountFrequency (%)
GB843
84.3%
RO16
 
1.6%
IE14
 
1.4%
PH14
 
1.4%
PK12
 
1.2%
DE9
 
0.9%
ES8
 
0.8%
PL7
 
0.7%
PT7
 
0.7%
TR6
 
0.6%
Other values (44)64
 
6.4%

Length

2022-06-01T22:25:29.177221image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
gb843
84.3%
ro16
 
1.6%
ie14
 
1.4%
ph14
 
1.4%
pk12
 
1.2%
de9
 
0.9%
es8
 
0.8%
pl7
 
0.7%
pt7
 
0.7%
tr6
 
0.6%
Other values (44)64
 
6.4%

Most occurring characters

ValueCountFrequency (%)
G851
42.5%
B851
42.5%
P40
 
2.0%
E37
 
1.8%
R27
 
1.4%
T22
 
1.1%
I20
 
1.0%
K18
 
0.9%
H18
 
0.9%
L17
 
0.9%
Other values (15)99
 
5.0%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter2000
100.0%

Most frequent character per category

Uppercase Letter
ValueCountFrequency (%)
G851
42.5%
B851
42.5%
P40
 
2.0%
E37
 
1.8%
R27
 
1.4%
T22
 
1.1%
I20
 
1.0%
K18
 
0.9%
H18
 
0.9%
L17
 
0.9%
Other values (15)99
 
5.0%

Most occurring scripts

ValueCountFrequency (%)
Latin2000
100.0%

Most frequent character per script

Latin
ValueCountFrequency (%)
G851
42.5%
B851
42.5%
P40
 
2.0%
E37
 
1.8%
R27
 
1.4%
T22
 
1.1%
I20
 
1.0%
K18
 
0.9%
H18
 
0.9%
L17
 
0.9%
Other values (15)99
 
5.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII2000
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
G851
42.5%
B851
42.5%
P40
 
2.0%
E37
 
1.8%
R27
 
1.4%
T22
 
1.1%
I20
 
1.0%
K18
 
0.9%
H18
 
0.9%
L17
 
0.9%
Other values (15)99
 
5.0%

Interactions

2022-06-01T22:25:17.538107image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-06-01T22:25:11.666043image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-06-01T22:25:20.230443image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2022-06-01T22:25:11.746267image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Correlations

2022-06-01T22:25:29.224914image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2022-06-01T22:25:29.279311image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2022-06-01T22:25:29.333072image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2022-06-01T22:25:29.388526image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2022-06-01T22:25:28.629911image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
A simple visualization of nullity by column.
2022-06-01T22:25:28.700286image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

df_indexstatementIDbirthDatenationality
0179518949546339154488751977-11-01GB
166081222010781604371211976-04-01GB
24167148973212332923324371944-08-01GB
3387249001209176383394311983-03-01GB
4146821870056042042669021975-11-01GB
5512484774915182281886241982-12-01GB
62389119407343528050196611954-12-01GB
71836145938849783141553941969-09-01GB
8784169103486320767799391982-06-01GB
9768108729223844097260131997-01-01GB

Last rows

df_indexstatementIDbirthDatenationality
9904325103018344505073815791957-10-01GB
991398936238057965099130941968-09-01GB
992124279220843669846165932001-09-01GB
993483521265115528583753911968-06-01GB
994342989757103242491483712003-07-01LV
9952138175967040983751317511977-01-01GB
9963783142185170603604186961969-02-01GB
9971222149974620626462126031990-04-01GB
9984487100818750720927828111982-03-01GB
999970169802724731631857661981-08-01RO